Stochastic consolidation of lifelong memory

Humans have the remarkable ability to continually store new memories, while maintaining old memories for a lifetime. How the brain avoids catastrophic forgetting of memories due to interference between encoded memories is an open problem in computational neuroscience. Here we present a model for continual learning in a recurrent neural network combining Hebbian learning, synaptic decay and a novel memory consolidation mechanism: memories undergo stochastic rehearsals with rates proportional to the memory’s basin of attraction, causing self-amplified consolidation. This mechanism gives rise to memory lifetimes that extend much longer than the synaptic decay time, and retrieval probability of memories that gracefully decays with their age. The number of retrievable memories is proportional to a power of the number of neurons. Perturbations to the circuit model cause temporally-graded retrograde and anterograde deficits, mimicking observed memory impairments following neurological trauma.

We use a mean field theory to approximate the basin of attraction of a memory pattern as a function of the memory efficacy A l (t) and of the global interference noise amplitude ∆(t) = f /N k A 2 k (t) (Fig. S1, see methods in the main text). To check the validity of our approximation, we ran full network simulations, where random activation patterns (dimension N , sparseness f ) were generated and stored in the network via the synaptic dynamics (Eq. (1) in the main text), where we used the basin of attraction approximated by the mean field calculation. Then, for a given memory, we generated a perturbed initial condition in the following way: k active units (0 ≤ k ≤ f N ) were chosen randomly and flipped to inactive state, and k inactive units in the memory state were chosen randomly and flipped to active state (so that the total activation of f N is maintained). Then, we initialized the network dynamics at that state, ran it until convergence, and measured the overlap between the final state and the original memory state. We kept increasing k until the final overlap dropped beneath 0.85 (or k reached f N ). We repeated this process 50 times, in each we chose randomly the units to be flipped. We define the basin of attraction size as k max /f N , where k max is the average (over trials) of the maximal k for achieving final overlap of 0.85. In Fig. S2 we plot the basin of attraction as a function of the efficacy A for both the full simulation calculation and for the mean field approximation for several values of N . As one can see, the prediction is good, and especially the prediction of the critical efficacy, where the basin size vanishes. As shown in Fig. (3) in the main text, the forgetting curves obtained from the full network simulations and the mean field approximation fits perfectly, regardless of the small deviations seen in the basins size. Assuming f t and large overlap (M ≈ 1, and therefore f − 1, 1−f + 1), equations (31) and (32) in the main text can be approximated by: Now, using the fixed activation assumption: we obtain an expression for the threshold θ: Demanding that the efficacy is larger than θ, we obtain: and therefore, a(f ) = 2 log(1/f ).
A similar scaling with f for the memory capacity was obtained by Tsodyks and Feigel'man (ref. [37] in the main text) for a static model (no synaptic decay and no rehearsals) using a full mean field calculation. Interestingly, it was shown by Tsodyks (ref. [38] in the main text) that also for sparsely connected networks with sparse activation, a similar scaling of the capacity holds.
By solving numerically the full fixed point equation in the main text (36) without taking M → 1 , we found that the approximated expression gives the right functional behavior, but required a constant numerical correction ( as a function of f .

C. Fixed threshold
When assuming fixed threshold instead of fixed activation level, the basin calculation is modified: instead of equation (25) in the main text we have an equation for the mean activation level: By a similar calculation as in the Methods section, using equations (31), (32) and (27) in the main text together with eq. (S11) we can write the overlap and activation dynamics as: and For a given threshold θ the basin of attraction size is a function of two variables, memory efficacy A l and the interference noise ∆ 2 = 1 N l A 2 l . We estimate the basin size for given values of A l , ∆ and θ in a similar way to the fixed activation scenario (Methods): First, we run the dynamics described by equation (S12) from initial condition M (0) = 1, repeat the process with smaller M (0) value until either M f is smaller than 0.85 or M (0) is smaller than 0.01. We define the basin size as the distance between the unperturbed overlap (starting at M (0) = 1) and the minimal M (0) for which M f > 0.85 (or zero if we reach M (0) < 0.01). Results are shown in Fig. S4. The main qualitative differences are that A c does not go to zero for ∆ = 0 , because even in the absence of interference, the fixed threshold puts a lower bound on the value of the local fields that will allow activation of neurons. As a result, A c is finite at low ∆ and increases very slowly with it. However, above a critical value of ∆ fluctuations are strong enough to allow neurons that are off in the memory state to turn on, effectively making all memories unstable (similar to the catastrophic forgetting in the classical sparse Hopfield model (Tsodyks and Feigel'man, ref.
[37] in the main text). The system's capacity depends now on the value of the threshold (relative to the encoding strength A(0) = 1). On one hand, increasing the threshold size increases the minimal efficacy required for activation, hence A c increases; on the other hand it increases the critical ∆ value at which all memories become unstable. Therefore we should expect an optimal threshold value for maximizing the capacity (Tsodyks and Feigel'man, ref. [37] in the main text). Indeed, for a given b, λτ we calculate the capacity for different threshold values and show that there is an optimal value (Fig. S6b), where capacity is similar to the adaptive threshold (constant activation) setting.
Memory efficacies dynamics and forgetting curve shapes are qualitatively similar to the constant activation setting (Fig. S6a).
When modeling memory deficits due to perturbations, we used the fixed threshold scenario in order to account for potential memory loss due to disruption of the level of activity in the network; specifically, allowing for memory loss due to insufficient activation (in the case of synaptic dilution) or due to over-activation (in the case of noisy synaptic dynamics).
The fixed threshold was chosen as the one that maximizes the capacity in the control case (before perturbations). For the chosen parameters τ /(2N ) = 0.01, λτ = 10, b = 0.2, the optimal threshold is θ = 0.36 (Fig. S6b). In the threshold adaptation paragraph we adapted the threshold to a value maximizing capacity in the presence of the perturbation.

II. EFFICACIES FLUCTUATIONS AROUND THE FIXED POINT A f p
At the steady state (after ∆ had saturated), after a memory gets consolidated, its efficacy fluctuates around the fixed point A f p until a large enough fluctuation drives it beneath A c , and then it decays exponentially to zero. The stochastic rehearsals driving the random walk around A f p can be approximated by a Poisson process with rate: where F (A f p /∆) is the basin of attraction size function evaluated at A f p and the steady state ∆. Now, n rehearsal events after consolidation, the value of the efficacy can be written as: where t i is the time of rehearsal event i. Let us define ∆t i ≡ t i − t and the total time since Now, define which are independent random variables (∆t i are independent) , and A(t) = b i y i . The probability density of y i : We used the fact that ∆t i are uniformly distributed on T . Now, using the condition for a fixed point, Now we can calculate the mean and variance of A(t) The last limit assumes large number of rehearsals n.

III. THE EFFECTIVE FORGETTING TIME, CONSOLIDATION PROBABILITY AND AND CAPACITY
After A c saturates, the behavior of every memory efficacy is a random walk with an attraction to A f p . This process will continue until the first time the efficacy will cross A c , and from that moment the memory is forgotten, and will decay exponentially. The forgetting curve is essentially the distribution of first passage times for this stochastic process. For Gaussian noise it is known to be an approximate exponential distribution, with mean first passage (forgetting) time: Where σ 2 is the variance of the driving Gaussian process. In our case the efficacy dynamics is driven by non-homogeneous Poisson noise, and there is no closed form solution for the mean first passage time from A f p to A c we are aware of. To get an approximated and insightful expression for the mean life time of consolidated memories, we work as follows: First, in the absence of rehearsals, a memory will drop from A f p to A c over τ log A f p /A c time. In a homogeneous Poisson process with rate λ, the chance of not having events for that time is exp (−λτ log (A f p /A c )) = (A c /A f p ) λτ . However, in our model the rehearsal rate is not homogeneous. Therefore, we estimate the initial point and the rehearsal rate at an intermediate point, which we find empirically to give a good fit to the data. We then approximate the mean forgetting time as τ times the inverse of the probability function: We found empirically that the first passage time distribution is again exponential (Fig. 3a,b in the main text). Given the exponential shape of the forgetting curve and τ c , we can evaluate the capacity n r , the number of memories which are retrievable at the steady state, as an integral over the forgetting curve (which equals τ c ), multiplied by the chance of a memory to get consolidated (or, equivalently, reach A f p at some time point): where p c is the probability of a memory efficacy to get consolidated and join the efficacies fluctuating around A f p .
We approximate p c by assuming that memories that don't consolidate are the ones freefalling exponentially from the initial point A 0 to the critical point A c . The time it takes is τ log(A 0 /A c ), and therefore, due to the Poisson statistics (we again use an intermediate point, due to the non homogeneous rate), Here F (x) is the basin of attraction size.
As we have seen above the long life time of consolidated memories as well as the memory capacity depend on the steady state critical efficacy A c and the fixed point efficacy A fp .
In fact, knowing A c , the fixed point efficacy can be determined by the fixed point equation , and is in general close to bλτ . Then, τ c (and therefore the capacity) can be determined by eq. (S25). To evaluate A c or equivalently the interference noise at steady state, we evaluate the second moment of all efficacies via the stationary efficacy histogram, Fig. 2c. One contribution is from the central 'bump' corresponding to consolidated memories fluctuating around the fixed point. These efficacies contribute approximately n r (A 2 fp +0.5A fp ) where the second term is the contribution of the Poisson variance around the mean. Another contribution is from memories with efficacies smaller than A c which decay to zero, hence contribute 0.5A 2 c τ . Finally, when 1 A fp there is a long tail outside the center peak, coming from the initial decay of the high efficacies toward the fixed point, contributing roughly 0.5τ (when A fp is not very small compared to unity, this term is negligible compared to the rest).
Thus, we approximate A c as Together with equations (S25), and (S27), we have a set of self consistent equations for the macroscopic variables of the theory. When consolidated memories provide the dominant contribution (which is the regime we are interested in, where the situation is qualitatively different than pure forgetting), the consolidation probability is close to unity and A f p ≈ bλτ , A c can be approximated as: Under these assumptions, τ c ≈ n r can be approximated as: The theory gives a good fit to simulation results (Fig. and Fig.4 in the main text) and allows us to observe the relations between model parameters (τ, λ, b, f, N ) and the system's behavior.

V. TRANSIENT SYNAPTIC DILUTION
In the main paper we present results for an ongoing dilution of a fraction of the synapses.
Here, we show what happens if there is a transient event of silencing (for example, a temporal blockage of a blood vessel, etc). In Fig.S13 we show the forgetting curve for a strong transient event lasting for 2τ time. There is a significant reduction in retrieval of memories entered just before the onset (temporally graded retrograde amnesia) and a tremendous reduction of retrieval probability for memories learned during the event. This is similar to the phenomenology of Transient Global Amnesia (TGA), where patients suffer from a whole day during which they have hard time retrieve past memories, and after the episode is over there is a strong amnesia of events that happened during the episode, and of events that happened just before the event.

VI. DETERMINISTIC MODEL
When replacing the Poisson rehearsals process by its average, a deterministic model for the memory efficacies arises (In this part we will use b = 1, and vary A 0 ): Initially all A's are zero, and so is ∆. At that phase, every A that enters grows towards A f p (which is initially λτ , and decrease slowly when ∆ increase). While A's accumulate at What will happen next depends on A 0 : for small enough A 0 , ∆ will grow until A s (∆) = A 0 . From that point on, all the new A's will decay exponentially, and all the past A's will stay at A f p , determined by the ∆ at that point (Fig. S15 top). We can write an equation relating ∆ to the number of A's that will stay in A f p : by: The first contribution comes from the memories accumulated at A f p (n is their number) and the second contribution comes from the decaying memories.
For large enough A 0 , ∆ will grow until A s (∆) = A f p (∆) < A 0 . When that happens, A f p looses its stability, and all the A's accumulated there will start decaying exponentially. That will cause ∆ to decrease, restoring the stability of A f p, and memories will start accumulating there again, and ∆ will start increasing again -only until A s (∆) = A f p (∆). This process will continue periodically forever, constantly storing new groups of memories at A f p and deleting them, replacing them with a new group (Fig.S15, bottom).
The value of the maximal ∆ will be determined by A s (∆) = A f p (∆), or: Solving these equations gives also the critical A value, where the bifurcation from fixed point behavior to a limit cycle oscillatory behavior as a function of A 0 happens.
In the stochastic case studied in the main paper, a noisy version of the fixed point phase is observed for small enough A 0 values. However, in the stochastic case the limit cycle phase doesn't exist, due to the large fluctuations around A f p . Nevertheless, if we replace the Poisson noise by a Gaussian noise with small enough amplitude, a noisy limit cycle behavior can be observed (not shown).